Syntactic processing of the IPI PAN Corpus of Polish
نویسندگان
چکیده
The aim of this paper is to present recent and ongoing work on adorning the IPI PAN Corpus of Polish (Przepiórkowski 2004, 2006a) with partial syntactic annotation, with the ultimate aim of building a treebank of Polish. The work described here is a part of the project Automatic extraction of linguistic knowledge from a large corpus of Polish (a Ministry of Education and Science grant number 3T11C00328), aiming at the automatic construction of a valence dictionary.
منابع مشابه
Building a Morphosyntactic Lexicon and a Pre-syntactic Processing Chain for Polish
This paper introduces a new set of tools and resources for Polish which cover all the steps required to transform a raw unrestricted text into a reasonable input for a parser. This includes (1) a large-coverage morphological lexicon, developed thanks to the IPI PAN corpus as well as a lexical acquisition techique, and (2) multiple tools for spelling correction, segmentation, tokenization and na...
متن کاملOn Heads and Coordination in Valence Acquisition
The aim of this paper is to present the design of a partial syntactic annotation of the IPI PAN Corpus of Polish [22] and the corresponding extension of the corpus search engine Poliqarp [25,12] developed at the Institue of Computer Science PAS and currently employed in Polish and Portuguese corpora projects. In particular, we will argue for the need to distinguish between, and represent both, ...
متن کاملCorpus, Medical Text, Annotation Morpho-syntactic Tagging, Natural Language Processing Corpus of Medical Texts and Tools
There is only one large corpus of Polish annotated with morpho-syntactic information, namely The IPI PAN Corpus (IPIC). This situation is a big obstacle in creation of tools for natural language processing dedicated to the domain of medical texts. However, the real life medical texts exhibit features making them very distinct from the most of the texts stored in IPIC. In the paper, the attempts...
متن کاملAn Implementation of Combined Partial Parser and Morphosyntactic Disambiguator
The aim of this paper is to present a simple yet efficient implementation of a tool for simultaneous rule-based morphosyntactic tagging and partial parsing formalism. The parser is currently used for creating a treebank of partial parses in a valency acquisition project over the IPI PAN Corpus of Polish.
متن کاملA Rule-Based Tagger for Polish Based on Genetic Algorithm
In the paper an approach to the construction of rule-based morphosyntactic tagger for Polish is proposed. The core of the tagger are modules of rules (classification systems), acquired from the IPI PAN corpus by application of Genetic Algorithms. Each module is specialised in making decisions concerning different parts of a tag (a structure of attributes). The acquired rules are combined with l...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007